{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vOJEA7LrHWWq"
   },
   "source": [
    "# Design Criteria for Sequence Modeling?\n",
    "\n",
    "In order to model sequences, we need to:\n",
    "- Handle variable-length sequences\n",
    "- Track long-term dependencies\n",
    "- Maintain information about order\n",
    "- Share parameters across the sequence\n",
    "\n",
    "**Recurrent Neural Network (RNN) meet these sequence modeling design criteria.**\n",
    "\n",
    "# Applications of RNN\n",
    "\n",
    "- Many to One - Sentiment Classification\n",
    "- One to Many - Text Generation, Image Captioning\n",
    "- Many to Many - Machine Translation, Forecasting, Music Generation\n",
    "\n",
    "# RNN Issues\n",
    "\n",
    "- Exploding Gradients \n",
    "  - Use gradient clipping to scale big gradients\n",
    "- Problem of Long Term dependencies because of vanishing gradient\n",
    "  - Weight Initialization - Initialize weights to identity matrix\n",
    "  - Network Architecture - Use gated cells like LSTMs or GRUs. These architecture rely on a gated cell to track information throughout many time steps\n",
    "  - Activation Function - Using ReLU prevents derivative from shrinking the gradient when x > 0\n",
    "\n",
    "# Limitations of Recurrent Models\n",
    "\n",
    "- Encoding bottleneck\n",
    "- Slow - No parallelization\n",
    "- Not long memory\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ekl_qorkn303"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S08myYF3oQ5u"
   },
   "source": [
    "## Read the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5OKGxZki2gVi",
    "outputId": "0f61101c-94ef-4e04-f352-3c9ec7e2053a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mounted at /content/drive\n"
     ]
    }
   ],
   "source": [
    "from google.colab import drive\n",
    "drive.mount('/content/drive')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "bBcjJVIYoO8O",
    "outputId": "0c70b540-dc87-4a3e-ede2-7bf143cb184f"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-23742df7-6f87-41c2-95a3-0d33200fa44a\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "      <th>label_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>605</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: enron methanol ; meter # : 988291\\r\\n...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2349</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: hpl nom for january 9 , 2001\\r\\n( see...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3624</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: neon retreat\\r\\nho ho ho , we ' re ar...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4685</td>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: photoshop , windows , office . cheap ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2030</td>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: re : indian springs\\r\\nthis deal is t...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-23742df7-6f87-41c2-95a3-0d33200fa44a')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-23742df7-6f87-41c2-95a3-0d33200fa44a button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-23742df7-6f87-41c2-95a3-0d33200fa44a');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "   Unnamed: 0 label                                               text  \\\n",
       "0         605   ham  Subject: enron methanol ; meter # : 988291\\r\\n...   \n",
       "1        2349   ham  Subject: hpl nom for january 9 , 2001\\r\\n( see...   \n",
       "2        3624   ham  Subject: neon retreat\\r\\nho ho ho , we ' re ar...   \n",
       "3        4685  spam  Subject: photoshop , windows , office . cheap ...   \n",
       "4        2030   ham  Subject: re : indian springs\\r\\nthis deal is t...   \n",
       "\n",
       "   label_num  \n",
       "0          0  \n",
       "1          0  \n",
       "2          0  \n",
       "3          1  \n",
       "4          0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/text/spam_ham/email_data.csv')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "drRowQ-robmu",
    "outputId": "bbd4a1db-8b0d-41ee-ed97-a13a52e10938"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-21a8d3a6-2a74-4ca0-a741-8abab2b3cd35\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "      <th>label_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: enron methanol ; meter # : 988291\\r\\n...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: hpl nom for january 9 , 2001\\r\\n( see...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: neon retreat\\r\\nho ho ho , we ' re ar...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>spam</td>\n",
       "      <td>Subject: photoshop , windows , office . cheap ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Subject: re : indian springs\\r\\nthis deal is t...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-21a8d3a6-2a74-4ca0-a741-8abab2b3cd35')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-21a8d3a6-2a74-4ca0-a741-8abab2b3cd35 button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-21a8d3a6-2a74-4ca0-a741-8abab2b3cd35');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "  label                                               text  label_num\n",
       "0   ham  Subject: enron methanol ; meter # : 988291\\r\\n...          0\n",
       "1   ham  Subject: hpl nom for january 9 , 2001\\r\\n( see...          0\n",
       "2   ham  Subject: neon retreat\\r\\nho ho ho , we ' re ar...          0\n",
       "3  spam  Subject: photoshop , windows , office . cheap ...          1\n",
       "4   ham  Subject: re : indian springs\\r\\nthis deal is t...          0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.drop('Unnamed: 0', axis=1)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rstIiLobopit"
   },
   "outputs": [],
   "source": [
    "y = df['label_num']\n",
    "X = df[['text']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XN_dBgnVoq6M"
   },
   "outputs": [],
   "source": [
    "# Splitting into train and test\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "sya-K-TbqrER",
    "outputId": "e0e1996b-062a-40cc-9b8f-dd3e2359fb63"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-9a5e4ecb-277f-49cd-bf88-3f7697154b17\">\n",
       "    <div class=\"colab-df-container\">\n",
       "      <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5132</th>\n",
       "      <td>Subject: april activity surveys\\r\\nwe are star...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2067</th>\n",
       "      <td>Subject: message subject\\r\\nhey i ' am julie ^...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4716</th>\n",
       "      <td>Subject: txu fuels / sds nomination for may 20...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4710</th>\n",
       "      <td>Subject: re : richardson volumes nov 99 and de...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2268</th>\n",
       "      <td>Subject: a new era of online medical care .\\r\\...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9a5e4ecb-277f-49cd-bf88-3f7697154b17')\"\n",
       "              title=\"Convert this dataframe to an interactive table.\"\n",
       "              style=\"display:none;\">\n",
       "        \n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "       width=\"24px\">\n",
       "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
       "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
       "  </svg>\n",
       "      </button>\n",
       "      \n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      flex-wrap:wrap;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "      <script>\n",
       "        const buttonEl =\n",
       "          document.querySelector('#df-9a5e4ecb-277f-49cd-bf88-3f7697154b17 button.colab-df-convert');\n",
       "        buttonEl.style.display =\n",
       "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "        async function convertToInteractive(key) {\n",
       "          const element = document.querySelector('#df-9a5e4ecb-277f-49cd-bf88-3f7697154b17');\n",
       "          const dataTable =\n",
       "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                     [key], {});\n",
       "          if (!dataTable) return;\n",
       "\n",
       "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "            + ' to learn more about interactive tables.';\n",
       "          element.innerHTML = '';\n",
       "          dataTable['output_type'] = 'display_data';\n",
       "          await google.colab.output.renderOutput(dataTable, element);\n",
       "          const docLink = document.createElement('div');\n",
       "          docLink.innerHTML = docLinkHtml;\n",
       "          element.appendChild(docLink);\n",
       "        }\n",
       "      </script>\n",
       "    </div>\n",
       "  </div>\n",
       "  "
      ],
      "text/plain": [
       "                                                   text\n",
       "5132  Subject: april activity surveys\\r\\nwe are star...\n",
       "2067  Subject: message subject\\r\\nhey i ' am julie ^...\n",
       "4716  Subject: txu fuels / sds nomination for may 20...\n",
       "4710  Subject: re : richardson volumes nov 99 and de...\n",
       "2268  Subject: a new era of online medical care .\\r\\..."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1PZbKo4VqrHn",
    "outputId": "13d475ef-9b71-4391-cbf8-2331fd259eeb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4136, 1) (1035, 1)\n"
     ]
    }
   ],
   "source": [
    "print(X_train.shape, X_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "GNtP_96spFhn",
    "outputId": "31d6f2b4-c636-40e2-b9fe-26b9c5e29e3f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.8.2\n",
      "2.8.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "\n",
    "print(tf.__version__)\n",
    "\n",
    "print(keras.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CLPbMip_pJ83"
   },
   "outputs": [],
   "source": [
    "from keras.utils import np_utils\n",
    "from keras.models import Sequential \n",
    "from keras.layers import Dense, Dropout, BatchNormalization, Flatten"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "BcmZbpiwpMoc"
   },
   "outputs": [],
   "source": [
    "from keras.layers import Conv2D, MaxPooling2D\n",
    "from keras.layers.recurrent import SimpleRNN, LSTM, GRU\n",
    "from keras.layers import Bidirectional, Embedding\n",
    "from keras.preprocessing import sequence, text\n",
    "from keras.callbacks import EarlyStopping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mMiXERzg0JJ5"
   },
   "source": [
    "Credits -  \n",
    "https://dzlab.github.io/dltips/en/keras/keras-text-preprocessing/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CZ5-osM3zo1s"
   },
   "source": [
    "## Preprocessing\n",
    "\n",
    "1. Tokenization\n",
    "\n",
    "- Use `fit_on_texts` to update the tokenizer internal vocabulary based on a list of texts.   \n",
    "Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, \"The cat sat on the mat.\" It will create a dictionary s.t. `word_index[\"the\"] = 1; word_index[\"cat\"] = 2` it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).\n",
    "- Use `fit_on_sequences` to update the tokenizer internal vocabulary based on a list of sequences.\n",
    "\n",
    "2. Numericalization\n",
    "\n",
    "- Use `texts_to_sequences` to transforms each string in a list of strings to sequence of integers  \n",
    "Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.\n",
    "- Use `sequences_to_matrix` to convert a list of sequences into a Numpy matrix\n",
    "\n",
    "\n",
    "3. Sequence Padding\n",
    "- You can use `pad_sequences` to add padding to your data so that the result would have same format.\n",
    "\n",
    "\n",
    "# If RNN can work with variable length sequence, then why `PADDING`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dQgb1oqszib9"
   },
   "outputs": [],
   "source": [
    "token = text.Tokenizer()\n",
    "\n",
    "X_train_values = X_train['text'].tolist()\n",
    "\n",
    "token.fit_on_texts(list(X_train_values))\n",
    "\n",
    "X_train_seq = token.texts_to_sequences(X_train_values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "7dihT6qK4bYC",
    "outputId": "5574ab0f-8acf-421f-ca08-9ba84b5d8c12"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5916\n"
     ]
    }
   ],
   "source": [
    "max_seq_len = 0\n",
    "for seq in X_train_seq:\n",
    "  if len(seq) > max_seq_len:\n",
    "    max_seq_len = len(seq)\n",
    "\n",
    "print(max_seq_len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "R9UiKu7E1OmL",
    "outputId": "d1bce78c-5a4e-4515-f3fe-d848384444f8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4136, 600)\n",
      "(600,)\n",
      "51762\n"
     ]
    }
   ],
   "source": [
    "max_len = 600\n",
    "\n",
    "X_train_pad = sequence.pad_sequences(X_train_seq, maxlen=max_len)\n",
    "\n",
    "print(X_train_pad.shape)\n",
    "\n",
    "print(X_train_pad[1].shape)\n",
    "\n",
    "print(len(token.word_index))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "eX8-Nzb5ktLz",
    "outputId": "937d1b6d-c378-4f93-8bd2-1421e8e2be1e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('\\r', 1),\n",
       " ('the', 2),\n",
       " ('to', 3),\n",
       " ('and', 4),\n",
       " ('ect', 5),\n",
       " ('for', 6),\n",
       " ('of', 7),\n",
       " ('a', 8),\n",
       " (\"'\", 9),\n",
       " ('subject', 10)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(token.word_index.items())[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "NVzj-aMk2s12",
    "outputId": "72285219-bbcb-4568-83f2-55e8ef7637a2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential\"\n",
      "_________________________________________________________________\n",
      " Layer (type)                Output Shape              Param #   \n",
      "=================================================================\n",
      " embedding (Embedding)       (None, 600, 32)           1656416   \n",
      "                                                                 \n",
      " simple_rnn (SimpleRNN)      (None, 100)               13300     \n",
      "                                                                 \n",
      " dense (Dense)               (None, 1)                 101       \n",
      "                                                                 \n",
      "=================================================================\n",
      "Total params: 1,669,817\n",
      "Trainable params: 1,669,817\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/5\n",
      "65/65 [==============================] - 28s 361ms/step - loss: 0.5571 - accuracy: 0.7394\n",
      "Epoch 2/5\n",
      "65/65 [==============================] - 23s 360ms/step - loss: 0.1927 - accuracy: 0.9640\n",
      "Epoch 3/5\n",
      "65/65 [==============================] - 23s 360ms/step - loss: 1.3799 - accuracy: 0.7981\n",
      "Epoch 4/5\n",
      "65/65 [==============================] - 23s 360ms/step - loss: 0.2838 - accuracy: 0.8946\n",
      "Epoch 5/5\n",
      "65/65 [==============================] - 24s 365ms/step - loss: 0.2775 - accuracy: 0.8900\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7fb380cb8750>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "number_of_top_words = len(token.word_index)\n",
    "embedding_vector_length = 32\n",
    "\n",
    "# STEP-1 Define the Model\n",
    "model = None\n",
    "model = Sequential()\n",
    "\n",
    "# Embedding Layer\n",
    "model.add(Embedding(input_dim=number_of_top_words + 1, \n",
    "                    output_dim=embedding_vector_length, \n",
    "                    input_length=max_len))\n",
    "\n",
    "# SimpleRNN-100\n",
    "model.add(SimpleRNN(100))\n",
    "\n",
    "# FC-1\n",
    "model.add(Dense(1, activation='relu', kernel_initializer='he_normal'))\n",
    "\n",
    "\n",
    "# STEP-2 Compile the Model\n",
    "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n",
    "\n",
    "print(model.summary())\n",
    "\n",
    "# STEP-3 Fit the Model\n",
    "model.fit(X_train_pad, y_train, batch_size=64, epochs=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "zM8v5f8ViOJ6",
    "outputId": "02bb34ee-92e9-4efb-9ee1-a2520ffac749"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1035, 600)\n",
      "(600,)\n"
     ]
    }
   ],
   "source": [
    "X_test_values = X_test['text'].tolist()\n",
    "\n",
    "X_test_seq = token.texts_to_sequences(X_test_values)\n",
    "\n",
    "X_test_pad = sequence.pad_sequences(X_test_seq, maxlen=max_len)\n",
    "\n",
    "print(X_test_pad.shape)\n",
    "\n",
    "print(X_test_pad[1].shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "TQfgr4ZScCTN",
    "outputId": "2b41725c-3762-4117-f6a7-d562f0459ad1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test Loss and Test Accuracy [0.30551135540008545, 0.8599033951759338]\n"
     ]
    }
   ],
   "source": [
    "prediction_score = model.evaluate(X_test_pad, y_test, verbose=0)\n",
    "\n",
    "print('Test Loss and Test Accuracy', prediction_score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5c5ELU-bH8aH"
   },
   "source": [
    "# LSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wa2C-wOc9dd5",
    "outputId": "49a1b95d-db13-4138-d6cc-bfc95b99c8ba"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_1\"\n",
      "_________________________________________________________________\n",
      " Layer (type)                Output Shape              Param #   \n",
      "=================================================================\n",
      " embedding_1 (Embedding)     (None, 600, 32)           1656416   \n",
      "                                                                 \n",
      " lstm (LSTM)                 (None, 100)               53200     \n",
      "                                                                 \n",
      " dense_1 (Dense)             (None, 1)                 101       \n",
      "                                                                 \n",
      "=================================================================\n",
      "Total params: 1,709,717\n",
      "Trainable params: 1,709,717\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/5\n",
      "65/65 [==============================] - 158s 2s/step - loss: 0.5827 - accuracy: 0.7466\n",
      "Epoch 2/5\n",
      "65/65 [==============================] - 154s 2s/step - loss: 0.2678 - accuracy: 0.8361\n",
      "Epoch 3/5\n",
      "65/65 [==============================] - 154s 2s/step - loss: 0.1537 - accuracy: 0.9197\n",
      "Epoch 4/5\n",
      "65/65 [==============================] - 153s 2s/step - loss: 0.0881 - accuracy: 0.9729\n",
      "Epoch 5/5\n",
      "65/65 [==============================] - 152s 2s/step - loss: 0.0375 - accuracy: 0.9923\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7fb3022a52d0>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "number_of_top_words = len(token.word_index)\n",
    "embedding_vector_length = 32\n",
    "\n",
    "# STEP-1 Define the Model\n",
    "model = None\n",
    "model = Sequential()\n",
    "\n",
    "# Embedding Layer\n",
    "model.add(Embedding(input_dim=number_of_top_words + 1, \n",
    "                    output_dim=embedding_vector_length, \n",
    "                    input_length=max_len))\n",
    "\n",
    "# LSTM-100\n",
    "model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))\n",
    "\n",
    "# FC-1\n",
    "model.add(Dense(1, activation='relu', kernel_initializer='he_normal'))\n",
    "\n",
    "\n",
    "# STEP-2 Compile the Model\n",
    "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n",
    "\n",
    "print(model.summary())\n",
    "\n",
    "# STEP-3 Fit the Model\n",
    "model.fit(X_train_pad, y_train, batch_size=64, epochs=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Kf9gn6vg9dhA",
    "outputId": "78793174-46bc-4506-dbbf-0f1f4b91eec9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test Loss and Test Accuracy [0.11802364140748978, 0.9681159257888794]\n"
     ]
    }
   ],
   "source": [
    "prediction_score = model.evaluate(X_test_pad, y_test, verbose=0)\n",
    "\n",
    "print('Test Loss and Test Accuracy', prediction_score)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "machine_shape": "hm",
   "provenance": []
  },
  "gpuClass": "standard",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
